feat: re-land graceful scale-down for AlluxioRuntime with e2e test by jakharmonika364 · Pull Request #6061 · fluid-cloudnative/fluid

jakharmonika364 · 2026-06-25T04:34:31Z

Re-lands #5805, reverted in #6059 because it had no e2e coverage of the actual scale-down -> drain -> decommission -> data-integrity path.

This PR adds that e2e test under test/gha-e2e/alluxio-scaledown/: binds a 2-replica AlluxioRuntime, reads data, scales down to 1 replica, waits for the worker StatefulSet to actually converge, then reads again.

While wiring this up I found GracefulWorkerScaleDown was registered as a feature gate but never exposed as a flag on the alluxioruntime-controller binary (cmd/alluxio/app/alluxio.go), unlike csi and fluidapp, so it was unreachable in any real deployment. Fixed that too - the e2e test enables it via a deployment patch since there's no Helm value for it yet.

Original feature code is unchanged from #5805.

fluid-e2e-bot · 2026-06-25T04:34:45Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign yangyuliufeng for approval by writing /assign @yangyuliufeng in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

fluid-e2e-bot · 2026-06-25T04:34:49Z

Hi @jakharmonika364. Thanks for your PR.

I'm waiting for a fluid-cloudnative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

gemini-code-assist

Code Review

This pull request introduces a new end-to-end test suite for testing the graceful scale-down of AlluxioRuntime, including the test script and associated Kubernetes resource manifests. It also exposes the feature gate flags in the Alluxio controller command. The review feedback highlights a critical missing blank import required to register the feature gates, and suggests improvements to the E2E test script, such as simplifying the wait-loop logging logic and properly scoping variables as local to prevent global namespace pollution.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

jakharmonika364 · 2026-06-25T04:38:52Z

Fixed the wait_dataset_bound and wait_job_completed issues. The blank import suggestion isn't needed though - pkg/ddc/alluxio already imports pkg/features directly (replicas.go), and cmd/alluxio/app/alluxio.go imports pkg/ddc/alluxio, so the init() already runs. Verified --feature-gates works via go run ./cmd/alluxio start --help.

codecov · 2026-06-25T04:48:20Z

Codecov Report

❌ Patch coverage is 88.69048% with 19 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.89%. Comparing base (36f0467) to head (e8c3871).

Files with missing lines	Patch %	Lines
pkg/ddc/alluxio/replicas.go	87.87%	11 Missing and 5 partials ⚠️
pkg/features/features.go	0.00%	2 Missing ⚠️
cmd/alluxio/app/alluxio.go	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6061      +/-   ##
==========================================
+ Coverage   64.77%   64.89%   +0.11%     
==========================================
  Files         484      486       +2     
  Lines       33892    34058     +166     
==========================================
+ Hits        21954    22101     +147     
- Misses      10215    10229      +14     
- Partials     1723     1728       +5

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

Re-lands the AlluxioRuntime graceful worker scale-down feature by adding a new Kind-based GitHub Actions e2e scenario and exposing the --feature-gates flag on the alluxioruntime-controller binary so the gate can be enabled in real deployments.

Changes:

Add a new test/gha-e2e/alluxio-scaledown/ e2e scenario that scales an AlluxioRuntime from 2→1 and verifies data access before/after.
Expose feature gates on alluxioruntime-controller via DefaultMutableFeatureGate.AddFlag(...).
Wire the new e2e scenario into .github/scripts/gha-e2e.sh so it runs in CI.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
test/gha-e2e/alluxio-scaledown/test.sh	New e2e driver script that enables the gate, creates runtime, scales down, and re-reads data
test/gha-e2e/alluxio-scaledown/read_before_job.yaml	Job used to validate data access before scale-down
test/gha-e2e/alluxio-scaledown/read_after_job.yaml	Job used to validate data access after scale-down
test/gha-e2e/alluxio-scaledown/dataset.yaml	Dataset + 2-replica AlluxioRuntime manifest for the scale-down scenario
cmd/alluxio/app/alluxio.go	Adds `--feature-gates` flag exposure for the alluxio controller binary
.github/scripts/gha-e2e.sh	Runs the new alluxio scale-down e2e scenario in the CI e2e sequence

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…dvancedStatefulSet (fluid-cloudnative#4193) (fluid-cloudnative#5805)" (fluid-cloudnative#6059) This reverts commit 333e13a. Signed-off-by: Monika Jakhar <jakharmonika364@gmail.com>

Adds a real-cluster scenario exercising the scale-down -> drain -> decommission -> data-integrity path: bind a 2-replica AlluxioRuntime, read data, scale down to 1 replica, wait for the worker StatefulSet to actually converge (proving the drain doesn't stall), then read again. Also wires --feature-gates into the alluxioruntime-controller binary (cmd/alluxio/app/alluxio.go), mirroring cmd/csi and cmd/fluidapp - without it GracefulWorkerScaleDown was registered but unreachable on a real deployment, so the e2e test enables it via a deployment patch. Signed-off-by: Monika Jakhar <jakharmonika364@gmail.com>

Simplify wait_dataset_bound's elapsed-time tracking to a single counter (matches wait_worker_replicas), and declare succeed/job_failed as local in wait_job_completed to avoid polluting global scope. Signed-off-by: Monika Jakhar <jakharmonika364@gmail.com>

- check kubectl patch exit codes instead of letting failures surface as confusing downstream timeouts - log the WorkerDecommissioning condition while polling so a stuck drain is debuggable from CI output - replace the external apache.org mirror mount with an in-cluster MinIO bucket seeded with one known object, and assert its exact content before/after scale-down instead of just checking the directory is non-empty - the previous assertion would pass even if decommission silently dropped cached data instead of migrating it Signed-off-by: Monika Jakhar <jakharmonika364@gmail.com>

jakharmonika364 · 2026-06-25T10:24:14Z

Found why CI was failing: this branch forked from master before #6059's revert merged, so the PR's merge-ref was silently dropping the whole feature (pkg/features, replicas.go logic, etc.) since my commits never touched those files directly. Rebuilt the branch off current master with the revert reverted, then replayed the e2e/flag-fix commits on top - --feature-gates=GracefulWorkerScaleDown=true now actually registers.

jakharmonika364 · 2026-06-25T10:26:24Z

+    - mountPoint: s3://scaledown/
+      name: scaledown
+      options:
+        alluxio.underfs.s3.endpoint: "http://scaledown-minio:9000"


Leaving this as http - it's an in-cluster only MinIO with throwaway test creds, same as curvine's existing endpoint_url: http://minio:9000 fixture. No real traffic to protect here.

Signed-off-by: Monika Jakhar <jakharmonika364@gmail.com>

Alluxio mounts s3://scaledown/ at /scaledown inside its namespace (named after the Dataset mount's name), not at the FUSE root. The existing alluxio test never noticed this because its assertion just checks "ls /data" is non-empty, which passes either way. Use the correct /data/scaledown/fixture.txt path so the content check actually exercises the fixture. Signed-off-by: Monika Jakhar <jakharmonika364@gmail.com>

sonarqubecloud · 2026-06-25T10:48:22Z

Quality Gate passed

Issues
32 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
2.2% Duplication on New Code

See analysis details on SonarQube Cloud

fluid-e2e-bot Bot added the needs-ok-to-test label Jun 25, 2026

gemini-code-assist Bot reviewed Jun 25, 2026

View reviewed changes

Comment thread cmd/alluxio/app/alluxio.go

Comment thread test/gha-e2e/alluxio-scaledown/test.sh

Comment thread test/gha-e2e/alluxio-scaledown/test.sh

jakharmonika364 force-pushed the feat-alluxio-scaledown-e2e branch from 656ceff to 5860b3b Compare June 25, 2026 04:40

jakharmonika364 mentioned this pull request Jun 25, 2026

Revert "feat: support graceful scale-down for AlluxioRuntime" (#5805) #6059

Merged

cheyang requested review from TrafalgarZZZ and Copilot June 25, 2026 06:01

Copilot started reviewing on behalf of cheyang June 25, 2026 06:01 View session

Copilot AI reviewed Jun 25, 2026

View reviewed changes

Comment thread test/gha-e2e/alluxio-scaledown/test.sh Outdated

Comment thread test/gha-e2e/alluxio-scaledown/test.sh Outdated

Comment thread test/gha-e2e/alluxio-scaledown/test.sh